This notebook is merely to practice, the objective is to try and group the participants into some form of cluster given their answer to personality tests. You can get the dataset from Kaggle here.
From the dataset "codebook" we can read:
This data was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed with the "Big-Five Factor Markers" from the IPIP. Participants were informed that their responses would be recorded and used for research at the beginning of the test, and asked to confirm their consent at the end of the test.
The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree
| Feature | Description |
|---|---|
| EXT1 | I am the life of the party. |
| EXT2 | I don't talk a lot. |
| EXT3 | I feel comfortable around people. |
| EXT4 | I keep in the background. |
| EXT5 | I start conversations. |
| EXT6 | I have little to say. |
| EXT7 | I talk to a lot of different people at parties. |
| EXT8 | I don't like to draw attention to myself. |
| EXT9 | I don't mind being the center of attention. |
| EXT10 | I am quiet around strangers. |
| EST1 | I get stressed out easily. |
| EST2 | I am relaxed most of the time. |
| EST3 | I worry about things. |
| EST4 | I seldom feel blue. |
| EST5 | I am easily disturbed. |
| EST6 | I get upset easily. |
| EST7 | I change my mood a lot. |
| EST8 | I have frequent mood swings. |
| EST9 | I get irritated easily. |
| EST10 | I often feel blue. |
| AGR1 | I feel little concern for others. |
| AGR2 | I am interested in people. |
| AGR3 | I insult people. |
| AGR4 | I sympathize with others' feelings. |
| AGR5 | I am not interested in other people's problems. |
| AGR6 | I have a soft heart. |
| AGR7 | I am not really interested in others. |
| AGR8 | I take time out for others. |
| AGR9 | I feel others' emotions. |
| AGR10 | I make people feel at ease. |
| CSN1 | I am always prepared. |
| CSN2 | I leave my belongings around. |
| CSN3 | I pay attention to details. |
| CSN4 | I make a mess of things. |
| CSN5 | I get chores done right away. |
| CSN6 | I often forget to put things back in their proper place. |
| CSN7 | I like order. |
| CSN8 | I shirk my duties. |
| CSN9 | I follow a schedule. |
| CSN10 | I am exacting in my work. |
| OPN1 | I have a rich vocabulary. |
| OPN2 | I have difficulty understanding abstract ideas. |
| OPN3 | I have a vivid imagination. |
| OPN4 | I am not interested in abstract ideas. |
| OPN5 | I have excellent ideas. |
| OPN6 | I do not have a good imagination. |
| OPN7 | I am quick to understand things. |
| OPN8 | I use difficult words. |
| OPN9 | I spend time reflecting on things. |
| OPN10 | I am full of ideas. |
The time spent on each question is also recorded in milliseconds. These are the variables ending in _E. This was calculated by taking the time when the button for the question was clicked minus the time of the most recent other button click.
| Feature | Description |
|---|---|
| dateload | The timestamp when the survey was started. |
| screenw | The width the of user's screen in pixels |
| screenh | The height of the user's screen in pixels |
| introelapse | The time in seconds spent on the landing / intro page |
| testelapse | The time in seconds spent on the page with the survey questions |
| endelapse | The time in seconds spent on the finalization page (where the user was asked to indicate if they has answered accurately and their answers could be stored and used for research. Again: this dataset only includes users who answered "Yes" to this question, users were free to answer no and could still view their results either way) |
| IPC | The number of records from the user's IP address in the dataset. For max cleanliness, only use records where this value is 1. High values can be because of shared networks (e.g. entire universities) or multiple submissions |
| country | The country, determined by technical information (NOT ASKED AS A QUESTION) |
| lat_appx_lots_of_err | approximate latitude of user. determined by technical information, THIS IS NOT VERY ACCURATE. Read the article "How an internet mapping glitch turned a random Kansas farm into a digital hell" https://splinternews.com/how-an-internet-mapping-glitch-turned-a-random-kansas-f-1793856052 to learn about the perils of relying on this information |
| long_appx_lots_of_err | approximate longitude of user |
Let's import some data and take a look at it
from sklearn.metrics import davies_bouldin_score,calinski_harabasz_score,silhouette_score
from sklearn.preprocessing import RobustScaler, StandardScaler, LabelEncoder
from sklearn.cluster import KMeans, MiniBatchKMeans, Birch
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.utils import resample
from matplotlib import pyplot as plt
from tqdm import tqdm
import seaborn as sb
import pandas as pd
import helpers as h
import numpy as np
import os
%matplotlib inline
we'll set the folder path and seed random processes to get a deterministic environment and allow reproducibility
os.chdir('D:\Documents\Repos\personality-test')
DATA_FOLDER = 'data'
DATA_FILE = 'data-final.csv'
df = pd.read_csv(os.path.join(DATA_FOLDER,DATA_FILE),sep='\t')
SEED=7
h.seed_everything(SEED)
we'll get the data and follow the advice on the codebook regarding IPC, and GPS info. we'll also delete some other metadata variables that I believe not to be useful
df = df[df['IPC']==1]
df.drop(columns=['dateload','screenw','screenh','introelapse','lat_appx_lots_of_err','long_appx_lots_of_err','IPC','endelapse'],inplace=True)
let's take a peek
df.describe()
| EXT1 | EXT2 | EXT3 | EXT4 | EXT5 | EXT6 | EXT7 | EXT8 | EXT9 | EXT10 | ... | OPN2_E | OPN3_E | OPN4_E | OPN5_E | OPN6_E | OPN7_E | OPN8_E | OPN9_E | OPN10_E | testelapse | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 695704.000000 | 695704.000000 | 695704.000000 | 695704.000000 | 695704.000000 | 695704.000000 | 695704.000000 | 695704.000000 | 695704.000000 | 695704.000000 | ... | 6.957040e+05 | 6.957040e+05 | 6.957040e+05 | 6.957040e+05 | 6.957040e+05 | 6.957040e+05 | 6.957040e+05 | 6.957040e+05 | 6.957040e+05 | 6.957040e+05 |
| mean | 2.577813 | 2.826747 | 3.221982 | 3.194435 | 3.230357 | 2.414146 | 2.703163 | 3.442938 | 2.940194 | 3.591510 | ... | 1.254535e+04 | 6.691977e+03 | 8.483657e+03 | 6.140398e+03 | 7.432983e+03 | 7.915356e+03 | 5.020315e+03 | 5.712047e+03 | 4.968422e+03 | 6.502470e+02 |
| std | 1.249742 | 1.322513 | 1.215858 | 1.231347 | 1.281609 | 1.230538 | 1.388454 | 1.267326 | 1.344401 | 1.293504 | ... | 1.371348e+06 | 3.208362e+05 | 4.134311e+05 | 3.495732e+05 | 4.739586e+05 | 6.624335e+05 | 1.473045e+05 | 2.062803e+05 | 2.682814e+05 | 1.586134e+04 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | -2.152050e+05 | -4.170310e+05 | -7.446700e+04 | -7.530000e+04 | -7.125690e+06 | -5.169400e+04 | -1.700700e+04 | -9.598600e+04 | -3.594871e+06 | 1.000000e+00 |
| 25% | 1.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 3.000000 | ... | 3.051000e+03 | 1.861000e+03 | 2.672000e+03 | 1.983000e+03 | 2.360000e+03 | 2.278000e+03 | 2.152000e+03 | 2.328000e+03 | 1.485000e+03 | 1.700000e+02 |
| 50% | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 2.000000 | 3.000000 | 4.000000 | 3.000000 | 4.000000 | ... | 4.225000e+03 | 2.738000e+03 | 3.707000e+03 | 2.833000e+03 | 3.320000e+03 | 3.195000e+03 | 3.057000e+03 | 3.251000e+03 | 2.193000e+03 | 2.200000e+02 |
| 75% | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 3.000000 | 4.000000 | 5.000000 | 4.000000 | 5.000000 | ... | 6.122000e+03 | 4.231000e+03 | 5.427000e+03 | 4.242000e+03 | 4.888000e+03 | 4.680000e+03 | 4.460000e+03 | 4.723000e+03 | 3.343000e+03 | 3.060000e+02 |
| max | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | ... | 1.026126e+09 | 1.244837e+08 | 2.015719e+08 | 1.626808e+08 | 2.435866e+08 | 3.891434e+08 | 7.803251e+07 | 1.138087e+08 | 9.048484e+07 | 5.372971e+06 |
8 rows × 101 columns
h.resumetable(df)
Dataset Shape: (696845, 102)
| Name | dtypes | Missing | Missing % | Uniques | Uniques % | Mean | Median | First Value | Second Value | Min Value | Max Value | Entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EXT1 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.57781 | 3 | 4 | 3 | 0.0 | 5.000000e+00 | 2.23 |
| 1 | EXT2 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.82675 | 3 | 1 | 5 | 0.0 | 5.000000e+00 | 2.33 |
| 2 | EXT3 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.22198 | 3 | 5 | 3 | 0.0 | 5.000000e+00 | 2.26 |
| 3 | EXT4 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.19443 | 3 | 2 | 4 | 0.0 | 5.000000e+00 | 2.28 |
| 4 | EXT5 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.23036 | 3 | 5 | 3 | 0.0 | 5.000000e+00 | 2.30 |
| 5 | EXT6 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.41415 | 2 | 1 | 3 | 0.0 | 5.000000e+00 | 2.20 |
| 6 | EXT7 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.70316 | 3 | 5 | 2 | 0.0 | 5.000000e+00 | 2.34 |
| 7 | EXT8 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.44294 | 4 | 2 | 5 | 0.0 | 5.000000e+00 | 2.25 |
| 8 | EXT9 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.94019 | 3 | 4 | 1 | 0.0 | 5.000000e+00 | 2.34 |
| 9 | EXT10 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.59151 | 4 | 1 | 5 | 0.0 | 5.000000e+00 | 2.21 |
| 10 | EST1 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.28911 | 3 | 1 | 2 | 0.0 | 5.000000e+00 | 2.31 |
| 11 | EST2 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.13997 | 3 | 4 | 3 | 0.0 | 5.000000e+00 | 2.28 |
| 12 | EST3 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.85977 | 4 | 4 | 4 | 0.0 | 5.000000e+00 | 2.02 |
| 13 | EST4 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.63759 | 3 | 2 | 1 | 0.0 | 5.000000e+00 | 2.29 |
| 14 | EST5 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.8475 | 3 | 2 | 3 | 0.0 | 5.000000e+00 | 2.30 |
| 15 | EST6 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.84609 | 3 | 2 | 1 | 0.0 | 5.000000e+00 | 2.34 |
| 16 | EST7 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.05149 | 3 | 2 | 2 | 0.0 | 5.000000e+00 | 2.31 |
| 17 | EST8 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.69076 | 3 | 2 | 1 | 0.0 | 5.000000e+00 | 2.32 |
| 18 | EST9 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.08782 | 3 | 3 | 3 | 0.0 | 5.000000e+00 | 2.30 |
| 19 | EST10 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.8371 | 3 | 2 | 1 | 0.0 | 5.000000e+00 | 2.33 |
| 20 | AGR1 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.23958 | 2 | 2 | 1 | 0.0 | 5.000000e+00 | 2.13 |
| 21 | AGR2 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.82091 | 4 | 5 | 4 | 0.0 | 5.000000e+00 | 2.05 |
| 22 | AGR3 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.25857 | 2 | 2 | 1 | 0.0 | 5.000000e+00 | 2.14 |
| 23 | AGR4 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.92139 | 4 | 4 | 5 | 0.0 | 5.000000e+00 | 1.98 |
| 24 | AGR5 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.28908 | 2 | 2 | 1 | 0.0 | 5.000000e+00 | 2.13 |
| 25 | AGR6 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.72558 | 4 | 3 | 5 | 0.0 | 5.000000e+00 | 2.14 |
| 26 | AGR7 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.22173 | 2 | 2 | 3 | 0.0 | 5.000000e+00 | 2.08 |
| 27 | AGR8 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.65711 | 4 | 4 | 4 | 0.0 | 5.000000e+00 | 2.07 |
| 28 | AGR9 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.77659 | 4 | 3 | 5 | 0.0 | 5.000000e+00 | 2.07 |
| 29 | AGR10 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.57569 | 4 | 4 | 3 | 0.0 | 5.000000e+00 | 2.08 |
| 30 | CSN1 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.28096 | 3 | 3 | 3 | 0.0 | 5.000000e+00 | 2.21 |
| 31 | CSN2 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.97932 | 3 | 4 | 2 | 0.0 | 5.000000e+00 | 2.35 |
| 32 | CSN3 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.98052 | 4 | 3 | 5 | 0.0 | 5.000000e+00 | 1.91 |
| 33 | CSN4 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.64195 | 3 | 2 | 3 | 0.0 | 5.000000e+00 | 2.28 |
| 34 | CSN5 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.57766 | 2 | 2 | 3 | 0.0 | 5.000000e+00 | 2.28 |
| 35 | CSN6 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.85124 | 3 | 4 | 1 | 0.0 | 5.000000e+00 | 2.34 |
| 36 | CSN7 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.69992 | 4 | 4 | 3 | 0.0 | 5.000000e+00 | 2.08 |
| 37 | CSN8 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.47637 | 2 | 2 | 3 | 0.0 | 5.000000e+00 | 2.18 |
| 38 | CSN9 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.13918 | 3 | 4 | 5 | 0.0 | 5.000000e+00 | 2.30 |
| 39 | CSN10 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.59561 | 4 | 4 | 3 | 0.0 | 5.000000e+00 | 2.04 |
| 40 | OPN1 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.73453 | 4 | 5 | 1 | 0.0 | 5.000000e+00 | 2.08 |
| 41 | OPN2 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 2.02634 | 2 | 1 | 2 | 0.0 | 5.000000e+00 | 1.99 |
| 42 | OPN3 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 4.03504 | 4 | 4 | 4 | 0.0 | 5.000000e+00 | 1.91 |
| 43 | OPN4 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 1.95589 | 2 | 1 | 2 | 0.0 | 5.000000e+00 | 1.94 |
| 44 | OPN5 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.80586 | 4 | 4 | 3 | 0.0 | 5.000000e+00 | 1.92 |
| 45 | OPN6 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 1.87126 | 2 | 1 | 1 | 0.0 | 5.000000e+00 | 1.88 |
| 46 | OPN7 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 4.01744 | 4 | 5 | 4 | 0.0 | 5.000000e+00 | 1.85 |
| 47 | OPN8 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.24774 | 3 | 3 | 2 | 0.0 | 5.000000e+00 | 2.28 |
| 48 | OPN9 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 4.18217 | 4 | 4 | 5 | 0.0 | 5.000000e+00 | 1.76 |
| 49 | OPN10 | float64 | 1141 | 0.001637 | 6 | 0.000009 | 3.97449 | 4 | 5 | 3 | 0.0 | 5.000000e+00 | 1.91 |
| 50 | EXT1_E | float64 | 1141 | 0.001637 | 70367 | 0.100979 | 100261 | 7322 | 9419 | 7235 | -42958762.0 | 2.147484e+09 | 14.44 |
| 51 | EXT2_E | float64 | 1141 | 0.001637 | 27320 | 0.039205 | 8437.86 | 3434 | 5491 | 3598 | -75632.0 | 2.617734e+08 | 13.02 |
| 52 | EXT3_E | float64 | 1141 | 0.001637 | 29039 | 0.041672 | 9707.34 | 3512 | 3959 | 3315 | -3593866.0 | 6.059057e+08 | 13.02 |
| 53 | EXT4_E | float64 | 1141 | 0.001637 | 31668 | 0.045445 | 7941.57 | 3473 | 4821 | 2564 | -2494907.0 | 1.687112e+08 | 13.09 |
| 54 | EXT5_E | float64 | 1141 | 0.001637 | 26249 | 0.037668 | 7700.61 | 3031 | 5611 | 2976 | -58566.0 | 3.510680e+08 | 12.80 |
| 55 | EXT6_E | float64 | 1141 | 0.001637 | 25316 | 0.036329 | 7045.08 | 3126 | 2756 | 3050 | -79860.0 | 3.164906e+08 | 12.85 |
| 56 | EXT7_E | float64 | 1141 | 0.001637 | 30104 | 0.043200 | 8018.8 | 4340 | 2388 | 4787 | -3594255.0 | 9.635879e+07 | 13.26 |
| 57 | EXT8_E | float64 | 1141 | 0.001637 | 29312 | 0.042064 | 7154.93 | 3617 | 2113 | 3228 | -461138.0 | 2.477062e+08 | 13.09 |
| 58 | EXT9_E | float64 | 1141 | 0.001637 | 27498 | 0.039461 | 6136.92 | 3651 | 5900 | 3465 | -35227.0 | 1.803694e+08 | 13.05 |
| 59 | EXT10_E | float64 | 1141 | 0.001637 | 24916 | 0.035755 | 5305.74 | 3224 | 4110 | 3309 | -142238.0 | 1.502521e+08 | 12.89 |
| 60 | EST1_E | float64 | 1141 | 0.001637 | 29436 | 0.042242 | 8264.63 | 3315 | 6135 | 9036 | -112165.0 | 2.403039e+08 | 13.08 |
| 61 | EST2_E | float64 | 1141 | 0.001637 | 29916 | 0.042931 | 8332.21 | 3603 | 4150 | 2406 | -71572.0 | 1.840717e+08 | 13.10 |
| 62 | EST3_E | float64 | 1141 | 0.001637 | 26044 | 0.037374 | 7350.07 | 2787 | 5739 | 3484 | -24118.0 | 5.250724e+08 | 12.77 |
| 63 | EST4_E | float64 | 1141 | 0.001637 | 42141 | 0.060474 | 10851.9 | 3575 | 6364 | 3359 | -3598047.0 | 8.800429e+08 | 13.31 |
| 64 | EST5_E | float64 | 1141 | 0.001637 | 30183 | 0.043314 | 7451.72 | 3500 | 3663 | 3061 | -88286.0 | 1.947344e+08 | 13.05 |
| 65 | EST6_E | float64 | 1141 | 0.001637 | 28573 | 0.041003 | 7943.11 | 3175 | 5070 | 2539 | -3574100.0 | 3.464129e+08 | 12.93 |
| 66 | EST7_E | float64 | 1141 | 0.001637 | 26708 | 0.038327 | 6540.14 | 3176 | 5709 | 4226 | -2187273.0 | 1.016919e+08 | 12.92 |
| 67 | EST8_E | float64 | 1141 | 0.001637 | 29660 | 0.042563 | 5662.61 | 2932 | 4285 | 2962 | -92455.0 | 2.567383e+08 | 12.86 |
| 68 | EST9_E | float64 | 1141 | 0.001637 | 24971 | 0.035834 | 4969.1 | 2791 | 2587 | 1799 | -79175662.0 | 1.838269e+08 | 12.76 |
| 69 | EST10_E | float64 | 1141 | 0.001637 | 27402 | 0.039323 | 4712.35 | 2569 | 3997 | 1607 | -43558.0 | 8.324175e+07 | 12.73 |
| 70 | AGR1_E | float64 | 1141 | 0.001637 | 39391 | 0.056528 | 17902.4 | 4376 | 4750 | 2158 | -2757521.0 | 1.170859e+09 | 13.61 |
| 71 | AGR2_E | float64 | 1141 | 0.001637 | 28221 | 0.040498 | 8942.01 | 3263 | 5475 | 2090 | -3592606.0 | 4.738983e+08 | 12.97 |
| 72 | AGR3_E | float64 | 1141 | 0.001637 | 28381 | 0.040728 | 6739.65 | 3168 | 11641 | 2143 | -1795552.0 | 1.301244e+08 | 12.93 |
| 73 | AGR4_E | float64 | 1141 | 0.001637 | 30341 | 0.043541 | 8468.81 | 3174 | 3115 | 2807 | -67786.0 | 3.365244e+08 | 12.98 |
| 74 | AGR5_E | float64 | 1141 | 0.001637 | 30785 | 0.044178 | 8746 | 4056 | 3207 | 3422 | -20294.0 | 1.563917e+08 | 13.18 |
| 75 | AGR6_E | float64 | 1141 | 0.001637 | 26392 | 0.037874 | 5877.35 | 2880 | 3260 | 5324 | -247504.0 | 1.018158e+08 | 12.78 |
| 76 | AGR7_E | float64 | 1141 | 0.001637 | 28487 | 0.040880 | 7529.42 | 3683 | 10235 | 4494 | -65423.0 | 2.518615e+08 | 13.05 |
| 77 | AGR8_E | float64 | 1141 | 0.001637 | 31762 | 0.045580 | 9288.41 | 3844 | 5897 | 3627 | -764938.0 | 1.367497e+09 | 13.17 |
| 78 | AGR9_E | float64 | 1141 | 0.001637 | 25205 | 0.036170 | 5197.64 | 3133 | 1758 | 1850 | -527846.0 | 6.275748e+07 | 12.85 |
| 79 | AGR10_E | float64 | 1141 | 0.001637 | 30066 | 0.043146 | 5709.94 | 3334 | 3081 | 1747 | -85674.0 | 8.158242e+07 | 12.97 |
| 80 | CSN1_E | float64 | 1141 | 0.001637 | 30939 | 0.044399 | 13067.6 | 3569 | 6602 | 5163 | -3590638.0 | 7.726592e+08 | 13.17 |
| 81 | CSN2_E | float64 | 1141 | 0.001637 | 34702 | 0.049799 | 10914.2 | 4275 | 5457 | 5240 | -35996486.0 | 2.637374e+08 | 13.35 |
| 82 | CSN3_E | float64 | 1141 | 0.001637 | 27477 | 0.039431 | 9202.25 | 3193 | 1569 | 7208 | -94464.0 | 1.100335e+09 | 12.89 |
| 83 | CSN4_E | float64 | 1141 | 0.001637 | 30055 | 0.043130 | 7964.61 | 3336 | 2129 | 2783 | -50476.0 | 2.690842e+08 | 12.99 |
| 84 | CSN5_E | float64 | 1141 | 0.001637 | 36242 | 0.052009 | 9445.28 | 3585 | 3762 | 4103 | -3512740.0 | 9.586233e+08 | 13.17 |
| 85 | CSN6_E | float64 | 1141 | 0.001637 | 31181 | 0.044746 | 10102.5 | 4359 | 4420 | 3431 | -74245.0 | 4.432097e+08 | 13.28 |
| 86 | CSN7_E | float64 | 1141 | 0.001637 | 24481 | 0.035131 | 5343.7 | 2913 | 9382 | 3347 | -30016.0 | 8.482811e+07 | 12.75 |
| 87 | CSN8_E | float64 | 1141 | 0.001637 | 47792 | 0.068583 | 11106.9 | 3733 | 5286 | 2399 | -177880.0 | 2.503232e+08 | 13.69 |
| 88 | CSN9_E | float64 | 1141 | 0.001637 | 24405 | 0.035022 | 5135.27 | 2915 | 4983 | 3360 | -29167.0 | 8.749788e+07 | 12.76 |
| 89 | CSN10_E | float64 | 1141 | 0.001637 | 40494 | 0.058110 | 9228.05 | 3923 | 6339 | 5595 | -14988.0 | 3.380158e+08 | 13.45 |
| 90 | OPN1_E | float64 | 1141 | 0.001637 | 26670 | 0.038272 | 8881.47 | 3023 | 3146 | 2624 | -53927742.0 | 6.750470e+08 | 12.87 |
| 91 | OPN2_E | float64 | 1141 | 0.001637 | 36984 | 0.053073 | 12545.3 | 4225 | 4067 | 4985 | -215205.0 | 1.026126e+09 | 13.30 |
| 92 | OPN3_E | float64 | 1141 | 0.001637 | 31354 | 0.044994 | 6691.98 | 2738 | 2959 | 1684 | -417031.0 | 1.244837e+08 | 12.86 |
| 93 | OPN4_E | float64 | 1141 | 0.001637 | 34083 | 0.048910 | 8483.66 | 3707 | 3411 | 3026 | -74467.0 | 2.015719e+08 | 13.13 |
| 94 | OPN5_E | float64 | 1141 | 0.001637 | 25447 | 0.036517 | 6140.4 | 2833 | 2170 | 4742 | -75300.0 | 1.626808e+08 | 12.74 |
| 95 | OPN6_E | float64 | 1141 | 0.001637 | 27514 | 0.039484 | 7432.98 | 3320 | 4920 | 3336 | -7125690.0 | 2.435866e+08 | 12.93 |
| 96 | OPN7_E | float64 | 1141 | 0.001637 | 27040 | 0.038803 | 7915.36 | 3195 | 4436 | 2718 | -51694.0 | 3.891434e+08 | 12.86 |
| 97 | OPN8_E | float64 | 1141 | 0.001637 | 24232 | 0.034774 | 5020.32 | 3057 | 3116 | 3374 | -17007.0 | 7.803251e+07 | 12.78 |
| 98 | OPN9_E | float64 | 1141 | 0.001637 | 28092 | 0.040313 | 5712.05 | 3251 | 2992 | 3096 | -95986.0 | 1.138087e+08 | 12.89 |
| 99 | OPN10_E | float64 | 1141 | 0.001637 | 23097 | 0.033145 | 4968.42 | 2193 | 4354 | 3019 | -3594871.0 | 9.048484e+07 | 12.46 |
| 100 | testelapse | float64 | 1141 | 0.001637 | 8741 | 0.012544 | 650.247 | 220 | 234 | 179 | 1.0 | 5.372971e+06 | 8.98 |
| 101 | country | object | 67 | 0.000096 | 221 | 0.000317 | NaN | NaN | GB | MY | NaN | NaN | 3.64 |
100%|████████████████████████████████████████████████████████████████████████████████| 102/102 [07:51<00:00, 4.62s/it]
for the categorical variables we see different amounts of skewness on each one, usually left-skewed as it seems option 0 is not much used in many of them. Onthe time variables, we seem to have some big outliers that makes the graph look like all data is on 0. However it is good that we only have few outliers and data is concentrated on more normal values. People shouldn't have to take 1e8 seconds to answer a question.
we have 1141 nulls that we can drop
df.dropna(inplace=True)
We also have 221 unique countries, however, the last graph just above this cell already let's us know that US has a desproportionate amount of data, about half the dataset. We could try and balance it out but that would halve our dataset so we'll see what comes out of it.
Let's try to visualize it better
TOP_N = 5
cnt = df['country'].value_counts(normalize=True)[:TOP_N]
cnt = cnt.append(pd.Series(df['country'].value_counts(normalize=True)[TOP_N:].sum(),index=['Other']))
print(cnt)
plt.style.use('seaborn-notebook')
cnt.plot(kind='pie',ylabel='')
US 0.496384 GB 0.071539 CA 0.062997 AU 0.049810 DE 0.017708 Other 0.301562 dtype: float64
<AxesSubplot:>
from the table we can aslo see that there are negative time amounts, which should't be possible. let's clean those out
for col in tqdm(df.drop(columns=['country']).columns):
df = df[df[col]>=0]
100%|████████████████████████████████████████████████████████████████████████████████| 101/101 [00:37<00:00, 2.66it/s]
h.resumetable(df,visualize=False)
Dataset Shape: (695225, 102)
| Name | dtypes | Missing | Missing % | Uniques | Uniques % | Mean | Median | First Value | Second Value | Min Value | Max Value | Entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EXT1 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.5777 | 3 | 4 | 3 | 0 | 5 | 2.23 |
| 1 | EXT2 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.82684 | 3 | 1 | 5 | 0 | 5 | 2.33 |
| 2 | EXT3 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.2219 | 3 | 5 | 3 | 0 | 5 | 2.26 |
| 3 | EXT4 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.19453 | 3 | 2 | 4 | 0 | 5 | 2.27 |
| 4 | EXT5 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.23028 | 3 | 5 | 3 | 0 | 5 | 2.30 |
| 5 | EXT6 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.4142 | 2 | 1 | 3 | 0 | 5 | 2.20 |
| 6 | EXT7 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.70305 | 3 | 5 | 2 | 0 | 5 | 2.34 |
| 7 | EXT8 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.44295 | 4 | 2 | 5 | 0 | 5 | 2.25 |
| 8 | EXT9 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.94016 | 3 | 4 | 1 | 0 | 5 | 2.34 |
| 9 | EXT10 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.59162 | 4 | 1 | 5 | 0 | 5 | 2.21 |
| 10 | EST1 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.2891 | 3 | 1 | 2 | 0 | 5 | 2.31 |
| 11 | EST2 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.14004 | 3 | 4 | 3 | 0 | 5 | 2.28 |
| 12 | EST3 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.85976 | 4 | 4 | 4 | 0 | 5 | 2.02 |
| 13 | EST4 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.63757 | 3 | 2 | 1 | 0 | 5 | 2.29 |
| 14 | EST5 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.84748 | 3 | 2 | 3 | 0 | 5 | 2.30 |
| 15 | EST6 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.84613 | 3 | 2 | 1 | 0 | 5 | 2.34 |
| 16 | EST7 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.0515 | 3 | 2 | 2 | 0 | 5 | 2.31 |
| 17 | EST8 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.69082 | 3 | 2 | 1 | 0 | 5 | 2.32 |
| 18 | EST9 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.08781 | 3 | 3 | 3 | 0 | 5 | 2.30 |
| 19 | EST10 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.83712 | 3 | 2 | 1 | 0 | 5 | 2.33 |
| 20 | AGR1 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.23959 | 2 | 2 | 1 | 0 | 5 | 2.13 |
| 21 | AGR2 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.82091 | 4 | 5 | 4 | 0 | 5 | 2.05 |
| 22 | AGR3 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.25861 | 2 | 2 | 1 | 0 | 5 | 2.14 |
| 23 | AGR4 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.92139 | 4 | 4 | 5 | 0 | 5 | 1.98 |
| 24 | AGR5 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.28908 | 2 | 2 | 1 | 0 | 5 | 2.13 |
| 25 | AGR6 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.72551 | 4 | 3 | 5 | 0 | 5 | 2.14 |
| 26 | AGR7 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.22175 | 2 | 2 | 3 | 0 | 5 | 2.08 |
| 27 | AGR8 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.65712 | 4 | 4 | 4 | 0 | 5 | 2.07 |
| 28 | AGR9 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.77656 | 4 | 3 | 5 | 0 | 5 | 2.07 |
| 29 | AGR10 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.57561 | 4 | 4 | 3 | 0 | 5 | 2.08 |
| 30 | CSN1 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.2809 | 3 | 3 | 3 | 0 | 5 | 2.21 |
| 31 | CSN2 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.97944 | 3 | 4 | 2 | 0 | 5 | 2.35 |
| 32 | CSN3 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.98043 | 4 | 3 | 5 | 0 | 5 | 1.91 |
| 33 | CSN4 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.6421 | 3 | 2 | 3 | 0 | 5 | 2.28 |
| 34 | CSN5 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.57754 | 2 | 2 | 3 | 0 | 5 | 2.28 |
| 35 | CSN6 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.85137 | 3 | 4 | 1 | 0 | 5 | 2.34 |
| 36 | CSN7 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.69986 | 4 | 4 | 3 | 0 | 5 | 2.08 |
| 37 | CSN8 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.47637 | 2 | 2 | 3 | 0 | 5 | 2.18 |
| 38 | CSN9 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.13912 | 3 | 4 | 5 | 0 | 5 | 2.30 |
| 39 | CSN10 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.59556 | 4 | 4 | 3 | 0 | 5 | 2.04 |
| 40 | OPN1 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.7346 | 4 | 5 | 1 | 0 | 5 | 2.08 |
| 41 | OPN2 | float64 | 0 | 0.0 | 6 | 0.000009 | 2.02635 | 2 | 1 | 2 | 0 | 5 | 1.99 |
| 42 | OPN3 | float64 | 0 | 0.0 | 6 | 0.000009 | 4.03509 | 4 | 4 | 4 | 0 | 5 | 1.91 |
| 43 | OPN4 | float64 | 0 | 0.0 | 6 | 0.000009 | 1.95589 | 2 | 1 | 2 | 0 | 5 | 1.94 |
| 44 | OPN5 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.8058 | 4 | 4 | 3 | 0 | 5 | 1.92 |
| 45 | OPN6 | float64 | 0 | 0.0 | 6 | 0.000009 | 1.8713 | 2 | 1 | 1 | 0 | 5 | 1.88 |
| 46 | OPN7 | float64 | 0 | 0.0 | 6 | 0.000009 | 4.01746 | 4 | 5 | 4 | 0 | 5 | 1.85 |
| 47 | OPN8 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.24783 | 3 | 3 | 2 | 0 | 5 | 2.28 |
| 48 | OPN9 | float64 | 0 | 0.0 | 6 | 0.000009 | 4.18219 | 4 | 4 | 5 | 0 | 5 | 1.76 |
| 49 | OPN10 | float64 | 0 | 0.0 | 6 | 0.000009 | 3.97446 | 4 | 5 | 3 | 0 | 5 | 1.91 |
| 50 | EXT1_E | float64 | 0 | 0.0 | 70313 | 0.101137 | 100245 | 7323 | 9419 | 7235 | 0 | 2.14748e+09 | 14.44 |
| 51 | EXT2_E | float64 | 0 | 0.0 | 27306 | 0.039276 | 8440.55 | 3434 | 5491 | 3598 | 0 | 2.61773e+08 | 13.02 |
| 52 | EXT3_E | float64 | 0 | 0.0 | 29023 | 0.041746 | 9715.5 | 3512 | 3959 | 3315 | 0 | 6.05906e+08 | 13.02 |
| 53 | EXT4_E | float64 | 0 | 0.0 | 31640 | 0.045510 | 7946.64 | 3473 | 4821 | 2564 | 0 | 1.68711e+08 | 13.09 |
| 54 | EXT5_E | float64 | 0 | 0.0 | 26225 | 0.037722 | 7700.84 | 3032 | 5611 | 2976 | 0 | 3.51068e+08 | 12.80 |
| 55 | EXT6_E | float64 | 0 | 0.0 | 25298 | 0.036388 | 7046.89 | 3126 | 2756 | 3050 | 0 | 3.16491e+08 | 12.85 |
| 56 | EXT7_E | float64 | 0 | 0.0 | 30083 | 0.043271 | 8023.23 | 4339 | 2388 | 4787 | 0 | 9.63588e+07 | 13.25 |
| 57 | EXT8_E | float64 | 0 | 0.0 | 29291 | 0.042132 | 7156.65 | 3616 | 2113 | 3228 | 0 | 2.47706e+08 | 13.09 |
| 58 | EXT9_E | float64 | 0 | 0.0 | 27483 | 0.039531 | 6137.38 | 3651 | 5900 | 3465 | 0 | 1.80369e+08 | 13.05 |
| 59 | EXT10_E | float64 | 0 | 0.0 | 24898 | 0.035813 | 5306.65 | 3224 | 4110 | 3309 | 0 | 1.50252e+08 | 12.88 |
| 60 | EST1_E | float64 | 0 | 0.0 | 29421 | 0.042319 | 8267.4 | 3315 | 6135 | 9036 | 0 | 2.40304e+08 | 13.08 |
| 61 | EST2_E | float64 | 0 | 0.0 | 29892 | 0.042996 | 8333.54 | 3602 | 4150 | 2406 | 0 | 1.84072e+08 | 13.09 |
| 62 | EST3_E | float64 | 0 | 0.0 | 26030 | 0.037441 | 7352.24 | 2787 | 5739 | 3484 | 0 | 5.25072e+08 | 12.77 |
| 63 | EST4_E | float64 | 0 | 0.0 | 42116 | 0.060579 | 10859.1 | 3575 | 6364 | 3359 | 0 | 8.80043e+08 | 13.31 |
| 64 | EST5_E | float64 | 0 | 0.0 | 30163 | 0.043386 | 7453.64 | 3500 | 3663 | 3061 | 0 | 1.94734e+08 | 13.05 |
| 65 | EST6_E | float64 | 0 | 0.0 | 28561 | 0.041082 | 7950.72 | 3175 | 5070 | 2539 | 0 | 3.46413e+08 | 12.93 |
| 66 | EST7_E | float64 | 0 | 0.0 | 26686 | 0.038385 | 6543.86 | 3176 | 5709 | 4226 | 0 | 1.01692e+08 | 12.92 |
| 67 | EST8_E | float64 | 0 | 0.0 | 29639 | 0.042632 | 5663.89 | 2932 | 4285 | 2962 | 0 | 2.56738e+08 | 12.86 |
| 68 | EST9_E | float64 | 0 | 0.0 | 24948 | 0.035885 | 5083.97 | 2791 | 2587 | 1799 | 0 | 1.83827e+08 | 12.76 |
| 69 | EST10_E | float64 | 0 | 0.0 | 27390 | 0.039397 | 4712.96 | 2569 | 3997 | 1607 | 0 | 8.32418e+07 | 12.73 |
| 70 | AGR1_E | float64 | 0 | 0.0 | 39360 | 0.056615 | 17899.3 | 4376 | 4750 | 2158 | 0 | 1.17086e+09 | 13.61 |
| 71 | AGR2_E | float64 | 0 | 0.0 | 28202 | 0.040565 | 8950.46 | 3263 | 5475 | 2090 | 0 | 4.73898e+08 | 12.97 |
| 72 | AGR3_E | float64 | 0 | 0.0 | 28363 | 0.040797 | 6742.53 | 3168 | 11641 | 2143 | 0 | 1.30124e+08 | 12.93 |
| 73 | AGR4_E | float64 | 0 | 0.0 | 30313 | 0.043602 | 8471 | 3174 | 3115 | 2807 | 0 | 3.36524e+08 | 12.98 |
| 74 | AGR5_E | float64 | 0 | 0.0 | 30764 | 0.044250 | 8746.91 | 4056 | 3207 | 3422 | 0 | 1.56392e+08 | 13.18 |
| 75 | AGR6_E | float64 | 0 | 0.0 | 26377 | 0.037940 | 5878.7 | 2880 | 3260 | 5324 | 0 | 1.01816e+08 | 12.78 |
| 76 | AGR7_E | float64 | 0 | 0.0 | 28472 | 0.040954 | 7531.02 | 3683 | 10235 | 4494 | 0 | 2.51861e+08 | 13.05 |
| 77 | AGR8_E | float64 | 0 | 0.0 | 31739 | 0.045653 | 9292.07 | 3844 | 5897 | 3627 | 0 | 1.3675e+09 | 13.17 |
| 78 | AGR9_E | float64 | 0 | 0.0 | 25186 | 0.036227 | 5198.95 | 3133 | 1758 | 1850 | 0 | 6.27575e+07 | 12.85 |
| 79 | AGR10_E | float64 | 0 | 0.0 | 30053 | 0.043228 | 5711.06 | 3334 | 3081 | 1747 | 0 | 8.15824e+07 | 12.97 |
| 80 | CSN1_E | float64 | 0 | 0.0 | 30923 | 0.044479 | 13083.1 | 3569 | 6602 | 5163 | 0 | 7.72659e+08 | 13.17 |
| 81 | CSN2_E | float64 | 0 | 0.0 | 34681 | 0.049885 | 10969.1 | 4274 | 5457 | 5240 | 0 | 2.63737e+08 | 13.35 |
| 82 | CSN3_E | float64 | 0 | 0.0 | 27458 | 0.039495 | 9205.65 | 3193 | 1569 | 7208 | 0 | 1.10033e+09 | 12.89 |
| 83 | CSN4_E | float64 | 0 | 0.0 | 30033 | 0.043199 | 7958.02 | 3336 | 2129 | 2783 | 0 | 2.69084e+08 | 12.99 |
| 84 | CSN5_E | float64 | 0 | 0.0 | 36227 | 0.052108 | 9452.12 | 3584 | 3762 | 4103 | 0 | 9.58623e+08 | 13.17 |
| 85 | CSN6_E | float64 | 0 | 0.0 | 31161 | 0.044821 | 10104.1 | 4359 | 4420 | 3431 | 0 | 4.4321e+08 | 13.28 |
| 86 | CSN7_E | float64 | 0 | 0.0 | 24468 | 0.035194 | 5344.83 | 2913 | 9382 | 3347 | 0 | 8.48281e+07 | 12.75 |
| 87 | CSN8_E | float64 | 0 | 0.0 | 47771 | 0.068713 | 11109.6 | 3733 | 5286 | 2399 | 0 | 2.50323e+08 | 13.69 |
| 88 | CSN9_E | float64 | 0 | 0.0 | 24389 | 0.035081 | 5135.82 | 2915 | 4983 | 3360 | 0 | 8.74979e+07 | 12.76 |
| 89 | CSN10_E | float64 | 0 | 0.0 | 40470 | 0.058211 | 9229.25 | 3922 | 6339 | 5595 | 0 | 3.38016e+08 | 13.44 |
| 90 | OPN1_E | float64 | 0 | 0.0 | 26653 | 0.038337 | 8962.5 | 3023 | 3146 | 2624 | 0 | 6.75047e+08 | 12.87 |
| 91 | OPN2_E | float64 | 0 | 0.0 | 36963 | 0.053167 | 12550.2 | 4225 | 4067 | 4985 | 0 | 1.02613e+09 | 13.30 |
| 92 | OPN3_E | float64 | 0 | 0.0 | 31331 | 0.045066 | 6693.28 | 2738 | 2959 | 1684 | 0 | 1.24484e+08 | 12.86 |
| 93 | OPN4_E | float64 | 0 | 0.0 | 34061 | 0.048993 | 8482.78 | 3707 | 3411 | 3026 | 0 | 2.01572e+08 | 13.13 |
| 94 | OPN5_E | float64 | 0 | 0.0 | 25434 | 0.036584 | 6132.35 | 2833 | 2170 | 4742 | 0 | 1.62681e+08 | 12.74 |
| 95 | OPN6_E | float64 | 0 | 0.0 | 27489 | 0.039540 | 7445.71 | 3320 | 4920 | 3336 | 0 | 2.43587e+08 | 12.93 |
| 96 | OPN7_E | float64 | 0 | 0.0 | 27023 | 0.038869 | 7917.11 | 3195 | 4436 | 2718 | 0 | 3.89143e+08 | 12.86 |
| 97 | OPN8_E | float64 | 0 | 0.0 | 24220 | 0.034838 | 5020.94 | 3057 | 3116 | 3374 | 0 | 7.80325e+07 | 12.78 |
| 98 | OPN9_E | float64 | 0 | 0.0 | 28075 | 0.040383 | 5713.16 | 3251 | 2992 | 3096 | 0 | 1.13809e+08 | 12.89 |
| 99 | OPN10_E | float64 | 0 | 0.0 | 23083 | 0.033202 | 4975 | 2193 | 4354 | 3019 | 0 | 9.04848e+07 | 12.46 |
| 100 | testelapse | float64 | 0 | 0.0 | 8735 | 0.012564 | 650.277 | 220 | 234 | 179 | 1 | 5.37297e+06 | 8.98 |
| 101 | country | object | 0 | 0.0 | 221 | 0.000318 | NaN | NaN | GB | MY | AD | ZW | 3.64 |
# We have some big outliers on time variables so we could use robust scaling for those and normal for the other ones
# set the structure
df_all_scaled = pd.get_dummies(df['country'].to_frame())
# go through each column and scale
for col in tqdm(df.columns.values):
if df[col].dtype != 'object':
if col.endswith('_E') | col.endswith('e'):
df_all_scaled[col] = RobustScaler().fit_transform(df[col].to_numpy().reshape(-1,1))
else:
df_all_scaled[col] = StandardScaler().fit_transform(df[col].to_numpy().reshape(-1,1))
100%|████████████████████████████████████████████████████████████████████████████████| 102/102 [00:03<00:00, 31.56it/s]
# let's check the shape and see if we didn't introduce any erros like nulls just in case, and let's take a look at the robust scaled time features
display(df_all_scaled.shape)
display(df_all_scaled.isna().sum().sum())
(695225, 322)
0
df_lbl_enc_scaled = pd.DataFrame(LabelEncoder().fit_transform(df['country']),columns=['country'])
for col in tqdm(df.columns.values):
if df[col].dtype != 'object':
if col.endswith('_E') | col.endswith('e'):
df_lbl_enc_scaled[col] = RobustScaler().fit_transform(df[col].to_numpy().reshape(-1,1))
else:
df_lbl_enc_scaled[col] = StandardScaler().fit_transform(df[col].to_numpy().reshape(-1,1))
display(df_lbl_enc_scaled.shape)
display(df_lbl_enc_scaled.isna().sum().sum())
100%|████████████████████████████████████████████████████████████████████████████████| 102/102 [00:03<00:00, 31.81it/s]
(695225, 102)
0
# first the imputations
time_cols = [x for x in df.columns.values if x.endswith('_E') | x.endswith('e')]
q75 = df[time_cols].quantile(.75)
med = df[time_cols].median()
imputed = pd.DataFrame()
for col in time_cols:
imputed[col] = np.where(df[col] > q75[col], med[col], df[col])
display(imputed.shape)
display(imputed.isna().sum().sum())
(695225, 51)
0
df_not_all_scaled = pd.DataFrame(LabelEncoder().fit_transform(df['country']),columns=['country'])
for col in tqdm(df.columns.values):
if df[col].dtype != 'object':
if col.endswith('_E') | col.endswith('e'):
df_not_all_scaled[col] = RobustScaler().fit_transform(imputed[col].to_numpy().reshape(-1,1))
else:
df_not_all_scaled[col] = df[col].to_numpy().reshape(-1,1)
display(df_not_all_scaled.shape)
display(df_not_all_scaled.isna().sum().sum())
h.resumetable(df_not_all_scaled[time_cols])
100%|████████████████████████████████████████████████████████████████████████████████| 102/102 [00:02<00:00, 44.25it/s]
(695225, 102)
0
Dataset Shape: (695225, 51)
| Name | dtypes | Missing | Missing % | Uniques | Uniques % | Mean | Median | First Value | Second Value | Min Value | Max Value | Entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EXT1_E | float64 | 0 | 0.0 | 11484 | 0.016518 | -0.325244 | 0.0 | 0.839407 | -0.035242 | -2.932719 | 1.840208 | 10.69 |
| 1 | EXT2_E | float64 | 0 | 0.0 | 4839 | 0.006960 | -0.372262 | 0.0 | 0.000000 | 0.158607 | -3.321083 | 1.594778 | 9.73 |
| 2 | EXT3_E | float64 | 0 | 0.0 | 4914 | 0.007068 | -0.370333 | 0.0 | 0.436098 | -0.192195 | -3.426341 | 1.609756 | 9.72 |
| 3 | EXT4_E | float64 | 0 | 0.0 | 4967 | 0.007144 | -0.367894 | 0.0 | 1.264540 | -0.852720 | -3.257974 | 1.618199 | 9.76 |
| 4 | EXT5_E | float64 | 0 | 0.0 | 4265 | 0.006135 | -0.365339 | 0.0 | 0.000000 | -0.063063 | -3.414414 | 1.665541 | 9.54 |
| 5 | EXT6_E | float64 | 0 | 0.0 | 4387 | 0.006310 | -0.366777 | 0.0 | -0.396146 | -0.081370 | -3.346895 | 1.597430 | 9.61 |
| 6 | EXT7_E | float64 | 0 | 0.0 | 6006 | 0.008639 | -0.388224 | 0.0 | -1.518288 | 0.348638 | -3.376654 | 1.484825 | 9.95 |
| 7 | EXT8_E | float64 | 0 | 0.0 | 5124 | 0.007370 | -0.369155 | 0.0 | -1.385253 | -0.357604 | -3.332719 | 1.606452 | 9.79 |
| 8 | EXT9_E | float64 | 0 | 0.0 | 5109 | 0.007349 | -0.375779 | 0.0 | 0.000000 | -0.176471 | -3.463947 | 1.611954 | 9.77 |
| 9 | EXT10_E | float64 | 0 | 0.0 | 4491 | 0.006460 | -0.376442 | 0.0 | 0.919087 | 0.088174 | -3.344398 | 1.543568 | 9.65 |
| 10 | EST1_E | float64 | 0 | 0.0 | 4889 | 0.007032 | -0.354286 | 0.0 | 0.000000 | 0.000000 | -3.055300 | 1.670968 | 9.75 |
| 11 | EST2_E | float64 | 0 | 0.0 | 5151 | 0.007409 | -0.370035 | 0.0 | 0.492363 | -1.074573 | -3.236298 | 1.607367 | 9.77 |
| 12 | EST3_E | float64 | 0 | 0.0 | 3990 | 0.005739 | -0.356719 | 0.0 | 0.000000 | 0.816159 | -3.263466 | 1.663934 | 9.52 |
| 13 | EST4_E | float64 | 0 | 0.0 | 5380 | 0.007739 | -0.353125 | 0.0 | 0.000000 | -0.192000 | -3.177778 | 1.808889 | 9.81 |
| 14 | EST5_E | float64 | 0 | 0.0 | 4936 | 0.007100 | -0.366418 | 0.0 | 0.157640 | -0.424565 | -3.384913 | 1.626692 | 9.74 |
| 15 | EST6_E | float64 | 0 | 0.0 | 4585 | 0.006595 | -0.361394 | 0.0 | 0.000000 | -0.654995 | -3.269825 | 1.685891 | 9.63 |
| 16 | EST7_E | float64 | 0 | 0.0 | 4504 | 0.006478 | -0.365225 | 0.0 | 0.000000 | 1.084711 | -3.280992 | 1.608471 | 9.65 |
| 17 | EST8_E | float64 | 0 | 0.0 | 4194 | 0.006033 | -0.360998 | 0.0 | 1.571429 | 0.034843 | -3.405343 | 1.727062 | 9.55 |
| 18 | EST9_E | float64 | 0 | 0.0 | 3988 | 0.005736 | -0.354944 | 0.0 | -0.239156 | -1.162954 | -3.271981 | 1.665885 | 9.52 |
| 19 | EST10_E | float64 | 0 | 0.0 | 3749 | 0.005392 | -0.345658 | 0.0 | 0.000000 | -1.196517 | -3.195274 | 1.730100 | 9.44 |
| 20 | AGR1_E | float64 | 0 | 0.0 | 6597 | 0.009489 | -0.364847 | 0.0 | 0.255814 | -1.517100 | -2.993160 | 1.687415 | 10.14 |
| 21 | AGR2_E | float64 | 0 | 0.0 | 4689 | 0.006745 | -0.361778 | 0.0 | 0.000000 | -1.159091 | -3.224308 | 1.645257 | 9.67 |
| 22 | AGR3_E | float64 | 0 | 0.0 | 4480 | 0.006444 | -0.362733 | 0.0 | 0.000000 | -1.099785 | -3.399142 | 1.672747 | 9.63 |
| 23 | AGR4_E | float64 | 0 | 0.0 | 4601 | 0.006618 | -0.356156 | 0.0 | -0.060327 | -0.375256 | -3.245399 | 1.696319 | 9.66 |
| 24 | AGR5_E | float64 | 0 | 0.0 | 5617 | 0.008079 | -0.380758 | 0.0 | -0.748677 | -0.559083 | -3.576720 | 1.598765 | 9.86 |
| 25 | AGR6_E | float64 | 0 | 0.0 | 4113 | 0.005916 | -0.361704 | 0.0 | 0.439815 | 0.000000 | -3.333333 | 1.696759 | 9.51 |
| 26 | AGR7_E | float64 | 0 | 0.0 | 5100 | 0.007336 | -0.379039 | 0.0 | 0.000000 | 0.785092 | -3.565344 | 1.599226 | 9.75 |
| 27 | AGR8_E | float64 | 0 | 0.0 | 5445 | 0.007832 | -0.372206 | 0.0 | 0.000000 | -0.187069 | -3.313793 | 1.587069 | 9.84 |
| 28 | AGR9_E | float64 | 0 | 0.0 | 4397 | 0.006325 | -0.364632 | 0.0 | -1.506024 | -1.405257 | -3.431544 | 1.642935 | 9.61 |
| 29 | AGR10_E | float64 | 0 | 0.0 | 4640 | 0.006674 | -0.378221 | 0.0 | -0.267725 | -1.679365 | -3.528042 | 1.632804 | 9.65 |
| 30 | CSN1_E | float64 | 0 | 0.0 | 5273 | 0.007585 | -0.361782 | 0.0 | 0.000000 | 1.370593 | -3.068788 | 1.676698 | 9.81 |
| 31 | CSN2_E | float64 | 0 | 0.0 | 6029 | 0.008672 | -0.383626 | 0.0 | 0.917766 | 0.749418 | -3.315749 | 1.553918 | 9.97 |
| 32 | CSN3_E | float64 | 0 | 0.0 | 4465 | 0.006422 | -0.364557 | 0.0 | -1.755676 | 0.000000 | -3.451892 | 1.643243 | 9.62 |
| 33 | CSN4_E | float64 | 0 | 0.0 | 4693 | 0.006750 | -0.369237 | 0.0 | -1.216734 | -0.557460 | -3.362903 | 1.611895 | 9.68 |
| 34 | CSN5_E | float64 | 0 | 0.0 | 5167 | 0.007432 | -0.362599 | 0.0 | 0.168880 | 0.492410 | -3.400380 | 1.725806 | 9.77 |
| 35 | CSN6_E | float64 | 0 | 0.0 | 5983 | 0.008606 | -0.398443 | 0.0 | 0.048644 | -0.740032 | -3.476077 | 1.485646 | 9.96 |
| 36 | CSN7_E | float64 | 0 | 0.0 | 4088 | 0.005880 | -0.359631 | 0.0 | 0.000000 | 0.503480 | -3.379350 | 1.641531 | 9.52 |
| 37 | CSN8_E | float64 | 0 | 0.0 | 6657 | 0.009575 | -0.296346 | 0.0 | 1.215180 | -1.043818 | -2.920970 | 2.460876 | 10.02 |
| 38 | CSN9_E | float64 | 0 | 0.0 | 4095 | 0.005890 | -0.368232 | 0.0 | 0.000000 | 0.505682 | -3.312500 | 1.587500 | 9.54 |
| 39 | CSN10_E | float64 | 0 | 0.0 | 5830 | 0.008386 | -0.361019 | 0.0 | 0.000000 | 1.328832 | -3.115171 | 1.695790 | 9.94 |
| 40 | OPN1_E | float64 | 0 | 0.0 | 4315 | 0.006207 | -0.369087 | 0.0 | 0.129747 | -0.420886 | -3.188819 | 1.603376 | 9.60 |
| 41 | OPN2_E | float64 | 0 | 0.0 | 5871 | 0.008445 | -0.384133 | 0.0 | -0.134583 | 0.647359 | -3.598807 | 1.615843 | 9.91 |
| 42 | OPN3_E | float64 | 0 | 0.0 | 4025 | 0.005789 | -0.352065 | 0.0 | 0.251995 | -1.201824 | -3.122007 | 1.701254 | 9.53 |
| 43 | OPN4_E | float64 | 0 | 0.0 | 5195 | 0.007472 | -0.377924 | 0.0 | -0.285990 | -0.657971 | -3.581643 | 1.661836 | 9.77 |
| 44 | OPN5_E | float64 | 0 | 0.0 | 4014 | 0.005774 | -0.358547 | 0.0 | -0.779083 | 0.000000 | -3.329025 | 1.655699 | 9.50 |
| 45 | OPN6_E | float64 | 0 | 0.0 | 4653 | 0.006693 | -0.373083 | 0.0 | 0.000000 | 0.016667 | -3.458333 | 1.633333 | 9.66 |
| 46 | OPN7_E | float64 | 0 | 0.0 | 4436 | 0.006381 | -0.376083 | 0.0 | 1.353326 | -0.520174 | -3.484188 | 1.619411 | 9.61 |
| 47 | OPN8_E | float64 | 0 | 0.0 | 4232 | 0.006087 | -0.374869 | 0.0 | 0.065193 | 0.350276 | -3.377901 | 1.550276 | 9.57 |
| 48 | OPN9_E | float64 | 0 | 0.0 | 4497 | 0.006468 | -0.380296 | 0.0 | -0.280607 | -0.167931 | -3.522210 | 1.594800 | 9.62 |
| 49 | OPN10_E | float64 | 0 | 0.0 | 3143 | 0.004521 | -0.344548 | 0.0 | 0.000000 | 1.165021 | -3.093089 | 1.622003 | 9.27 |
| 50 | testelapse | float64 | 0 | 0.0 | 306 | 0.000440 | -0.339571 | 0.0 | 0.274510 | -0.803922 | -4.294118 | 1.686275 | 6.56 |
100%|██████████████████████████████████████████████████████████████████████████████████| 51/51 [02:41<00:00, 3.17s/it]
well, that doesn't look that good, but I don't know if there is a good way to handle that many outliers on the time variables. Each one has plenty of high outliers, removing that much data would eliminate too much of our dataset.
df_simple = df.drop(columns=['country'] + time_cols)
df_simple['country'] = LabelEncoder().fit_transform(df['country'])
display(df_simple.shape)
display(df_simple.isna().sum().sum())
(695225, 51)
0
We'll start modeling now. I wanted to try a bunch of unsupervised clustering models (MeanShift, AffinityPropagation, AgglomerativeClustering, SpectralClustering, DBSCAN, OPTICS, Birch) but sadly my computer does not have the necessary specs to try more than KMeans with this many observations.
We'll check various performance measures, sum of squared errors, davies-bouldin, calinski-harabasz
#k-means args
kmeans_kwargs = {
"init": "k-means++",
"n_init": 10,
"max_iter": 100,
"random_state": SEED,
}
sse = {'df_all_scaled':[], 'df_lbl_enc_scaled':[], 'df_not_all_scaled':[], 'df_simple':[]} # A dict holds the SSE values for each k and each dataset
dbs = {'df_all_scaled':[], 'df_lbl_enc_scaled':[], 'df_not_all_scaled':[], 'df_simple':[]} # A dict holds the DBS values for each k and each dataset
chs = {'df_all_scaled':[], 'df_lbl_enc_scaled':[], 'df_not_all_scaled':[], 'df_simple':[]} # A dict holds the CHS values for each k and each dataset
rng = range(2, 18) # a range of clusters to go through
df_list = [df_all_scaled,df_lbl_enc_scaled,df_not_all_scaled,df_simple] #list of datasets to iterate through
df_names_list = ['df_all_scaled','df_lbl_enc_scaled','df_not_all_scaled','df_simple'] #list of dataset names to iterate through
for i in range(len(df_list)):
for k in tqdm(rng):
clusterer = MiniBatchKMeans(n_clusters=k, **kmeans_kwargs)
clusterer.fit(df_list[i])
labels = clusterer.labels_
sse[df_names_list[i]].append(clusterer.inertia_)
dbs[df_names_list[i]].append(davies_bouldin_score(df_list[i],labels))
chs[df_names_list[i]].append(calinski_harabasz_score(df_list[i],labels))
print('{} done'.format(i))
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [03:28<00:00, 13.03s/it] 0%| | 0/16 [00:00<?, ?it/s]
0 done
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [01:24<00:00, 5.29s/it] 0%| | 0/16 [00:00<?, ?it/s]
1 done
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [01:39<00:00, 6.22s/it] 0%| | 0/16 [00:00<?, ?it/s]
2 done
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [01:02<00:00, 3.88s/it]
3 done
plt.style.use('seaborn-notebook')
display(pd.DataFrame(sse,index=rng).plot(xlabel='Number of Clusters',ylabel='SSE'))
# we had to separate this one because the scale is so different we couldn't see it's value!!
display(pd.DataFrame(sse,index=rng)[['df_simple']].plot(xlabel='Number of Clusters',ylabel='SSE'))
<AxesSubplot:xlabel='Number of Clusters', ylabel='SSE'>
<AxesSubplot:xlabel='Number of Clusters', ylabel='SSE'>
This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves. Zero is the lowest possible score. Values closer to zero indicate a better partition
plt.style.use('seaborn-notebook')
display(pd.DataFrame(dbs,index=rng).plot(xlabel='Number of Clusters',ylabel='DBS'))
display(pd.DataFrame(dbs,index=rng)[['df_simple']].plot(xlabel='Number of Clusters',ylabel='DBS'))
<AxesSubplot:xlabel='Number of Clusters', ylabel='DBS'>
<AxesSubplot:xlabel='Number of Clusters', ylabel='DBS'>
Apparently the fewer the clusters the better?
A higher Calinski-Harabasz score relates to a model with better defined clusters.
The index is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters
plt.style.use('seaborn-notebook')
display(pd.DataFrame(chs,index=rng).plot(xlabel='Number of Clusters',ylabel='CHS'))
display(pd.DataFrame(chs,index=rng)[['df_all_scaled']].plot(xlabel='Number of Clusters',ylabel='CHS'))
display(pd.DataFrame(chs,index=rng)[['df_lbl_enc_scaled']].plot(xlabel='Number of Clusters',ylabel='CHS'))
<AxesSubplot:xlabel='Number of Clusters', ylabel='CHS'>
<AxesSubplot:xlabel='Number of Clusters', ylabel='CHS'>
<AxesSubplot:xlabel='Number of Clusters', ylabel='CHS'>
We've used MiniBatchKmeans to get an overall picture since it trains much faster without losing that much precision. However, given the above observations, we'll now use K-means to train with the simple dataframe and 4 clusters, trying to get the maximum amount of precision. Yes I know its "the big 5" and why not use 5 clusters? Well data is saying otherwise, who is to say the big 5 is not actually the big 4?
clusterer = KMeans(n_clusters=4, **kmeans_kwargs)
clusterer.fit(df_simple)
labels = clusterer.labels_
df_cl = df_simple.copy()
df_cl['cluster'] = labels
a = []
a.append('1')
a.append('2')
a
['1', '2']
def boxplots(df_cl, cl_col):
cl_nm = len(np.unique(df_cl[cl_col]))
for col in df_cl.drop(columns=[cl_col]).columns.values:
# data = [df_cl[df_cl['cluster']==0][col]
# ,df_cl[df_cl['cluster']==1][col]
# ,df_cl[df_cl['cluster']==2][col]
# ,df_cl[df_cl['cluster']==3][col]]
data = []
for i in range(cl_nm):
data.append(df_cl[df_cl[cl_col]==i][col])
fig = plt.figure(figsize =(10, 7))
ax = fig.add_subplot(111)
# Creating axes instance
bp = ax.boxplot(data, patch_artist = True,
notch ='True', vert = 0)
colors = ['#0000FF', '#00FF00',
'#FFFF00', '#FF00FF']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
# changing color and linewidth of
# whiskers
for whisker in bp['whiskers']:
whisker.set(color ='#8B008B',
linewidth = 1.5,
linestyle =":")
# changing color and linewidth of
# caps
for cap in bp['caps']:
cap.set(color ='#8B008B',
linewidth = 2)
# changing color and linewidth of
# medians
for median in bp['medians']:
median.set(color ='red',
linewidth = 3)
# changing style of fliers
for flier in bp['fliers']:
flier.set(marker ='D',
color ='#e7298a',
alpha = 0.5)
# x-axis labels
yticks = []
for i in range(cl_nm):
yticks.append('cluster {}'.format(i))
# ax.set_yticklabels(['cluster 0', 'cluster 1',
# 'cluster 2', 'cluster 3'])
ax.set_yticklabels(yticks)
# Adding title
plt.title(col)
# Removing top axes and right axes
# ticks
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
# show plot
plt.show(bp)
boxplots(df_cl,'cluster')
Let's check our guess on obs. nº 2
df_cl['cluster'].value_counts()
0 357937 1 120737 3 115597 2 100954 Name: cluster, dtype: int64
It seems we were right, cluster 0 has over 50% of the data. Might it have US observations?
for i in range(4):
print('Cluster {} top 3:'.format(i), df_cl[df_cl['cluster']==i]['country'].value_counts()[0:3], sep='\n')
Cluster 0 top 3: 206 345069 218 2830 200 2421 Name: country, dtype: int64 Cluster 1 top 3: 35 43801 12 34631 51 12309 Name: country, dtype: int64 Cluster 2 top 3: 160 11126 142 9777 151 9606 Name: country, dtype: int64 Cluster 3 top 3: 69 49744 94 12207 90 5604 Name: country, dtype: int64
cntry = df[['country']].copy()
cntry['enc'] = LabelEncoder().fit_transform(df['country'])
print('Cluster 0''s top country is ' + pd.unique(cntry[cntry['enc']==206]['country'])[0]
,'Cluster 1''s top country is ' + pd.unique(cntry[cntry['enc']==35]['country'])[0]
,'Cluster 2''s top country is ' + pd.unique(cntry[cntry['enc']==160]['country'])[0]
,'Cluster 3''s top country is ' + pd.unique(cntry[cntry['enc']==69]['country'])[0]
,sep='\n')
Cluster 0s top country is US Cluster 1s top country is CA Cluster 2s top country is PH Cluster 3s top country is GB
Another observation, Cluster 1 seems perhaps the most evenly distributed.
pca = PCA(n_components=2)
pca_fit = pca.fit_transform(df_simple)
df_pca = pd.DataFrame(data=pca_fit, columns=['PCA1', 'PCA2'])
df_pca['Clusters'] = labels
df_pca.head()
| PCA1 | PCA2 | Clusters | |
|---|---|---|---|
| 0 | 77.446001 | -5.779149 | 3 |
| 1 | 3.441659 | 0.155677 | 2 |
| 2 | 77.445594 | -0.931190 | 3 |
| 3 | 77.460765 | 0.808060 | 3 |
| 4 | -30.551297 | -2.700953 | 2 |
plt.figure(figsize=(10,10))
sb.scatterplot(data=df_pca, x='PCA1', y='PCA2', hue='Clusters', palette='Set2', alpha=0.8)
plt.title('Personality Clusters after PCA');
That doesn't look so good though...
(just with df_simple this time around)
sse = {'df_all_scaled':[], 'df_lbl_enc_scaled':[], 'df_not_all_scaled':[], 'df_simple':[]} # A dict holds the SSE values for each k and each dataset
dbs = {'df_all_scaled':[], 'df_lbl_enc_scaled':[], 'df_not_all_scaled':[], 'df_simple':[]} # A dict holds the DBS values for each k and each dataset
chs = {'df_all_scaled':[], 'df_lbl_enc_scaled':[], 'df_not_all_scaled':[], 'df_simple':[]} # A dict holds the CHS values for each k and each dataset
for k in tqdm(rng):
clusterer = MiniBatchKMeans(n_clusters=k, **kmeans_kwargs)
clusterer.fit(df_simple.drop(columns=['country']))
labels = clusterer.labels_
sse[df_names_list[i]].append(clusterer.inertia_)
dbs[df_names_list[i]].append(davies_bouldin_score(df_list[i],labels))
chs[df_names_list[i]].append(calinski_harabasz_score(df_list[i],labels))
print('{} done'.format(i))
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [01:04<00:00, 4.01s/it]
3 done
plt.style.use('seaborn-notebook')
print('Low is good')
display(pd.DataFrame(sse['df_simple'],index=rng).plot(xlabel='Number of Clusters',ylabel='SSE'))
Low is good
<AxesSubplot:xlabel='Number of Clusters', ylabel='SSE'>
plt.style.use('seaborn-notebook')
print('Low is good')
display(pd.DataFrame(dbs['df_simple'],index=rng).plot(xlabel='Number of Clusters',ylabel='DBS'))
Low is good
<AxesSubplot:xlabel='Number of Clusters', ylabel='DBS'>
plt.style.use('seaborn-notebook')
print('High is good')
display(pd.DataFrame(chs['df_simple'],index=rng).plot(xlabel='Number of Clusters',ylabel='CHS'))
High is good
<AxesSubplot:xlabel='Number of Clusters', ylabel='CHS'>
Okey let's this time assume "The big 5" thing works and do 5 clusters with K means, see if we can spot its characteristics
clusterer = KMeans(n_clusters=5, **kmeans_kwargs)
clusterer.fit(df_simple.drop(columns=['country']))
labels = clusterer.labels_
df_cl = df_simple.drop(columns=['country']).copy()
df_cl['cluster'] = labels
boxplots(df_cl,'cluster')
That looks much better! a lot more variance, meaning perhaps each cluster is more uniquely identified
pca = PCA(n_components=2)
pca_fit = pca.fit_transform(df_simple.drop(columns=['country']))
df_pca = pd.DataFrame(data=pca_fit, columns=['PCA1', 'PCA2'])
df_pca['Clusters'] = labels
df_pca.head()
| PCA1 | PCA2 | Clusters | |
|---|---|---|---|
| 0 | -5.582889 | -1.514488 | 1 |
| 1 | 0.136778 | 3.014162 | 0 |
| 2 | -0.762445 | 2.069605 | 0 |
| 3 | 1.000176 | 0.085359 | 2 |
| 4 | -2.774470 | 2.415694 | 1 |
plt.figure(figsize=(10,10))
sb.scatterplot(data=df_pca, x='PCA1', y='PCA2', hue='Clusters', palette='Set2', alpha=0.8)
plt.title('Personality Clusters after PCA');
Beautiful!!
There are more things we can do and try, like try and use neural networks to find the clusters. But I believe this dataset has it's limitations and we'll leave those adventures for more interesting datasets. This was a simple exercise of data analysis and some clustering for practice.
As a conclusion, it seems 5 clusters do work best for "The big 5", nothing wrong with some confirmative evidence right?